home *** CD-ROM | disk | FTP | other *** search
-
-
-
-
-
- AWK - A pattern Scanning and Processing Language
- (Second Edition)
-
- Alfred V. Aho
- Brian W. Kernighan
- Peter J. Weinberger
-
- Bell Laboratories
- Murray Hill, New Jersey 07974
-
-
-
- ABSTRACT
-
- Awk is a programming language whose basic operation is to search a set
- of files for patterns, and to perform specified actions upon lines or fields
- of lines which contain instances of those patterns. Awk makes certain data
- selection and transformation operations easy to express; for example, the
- awk program
-
- length > 72
-
- prints all input lines whose length exceeds 72 characters; the program
-
- NF % 2 == 0
-
- prints all lines with an even number of fields; and the program
-
- { $1 = log($1); print }
-
- replaces the first field of each line by its logarithm.
-
- Awk patterns may include arbitrary boolean combinations of regular
- expresssions and of relational operators on strings, numbers, fields,
- variables and array elements. Actions may include the same pattern-
- matching constructions as in patterns, as well as arithmetic and string
- expressions and assignments, if-else, while, for statements, and multiple
- output streams.
-
- This report contains a user's guide, a discussion of the design and
- implementation of awk, and some timing statistics.
-
- September 1, 1978
-
-
-
-
-
- AWK - A Pattern Scanning and Processing Language
- (Second Edition)
-
- Alfred V. Aho
- Brian W. Kernighan
- Peter J. Weinberger
-
- Bell Laboratories
- Murray Hill, New Jersey 07974
-
-
-
-
-
- 1. Introduction
-
- Awk is a programming language designed to make many common information
- retrieval and text manipulation tasks easy to state and to perform.
-
- The basic operation of awk is to scan a set of input lines in order,
- searching for lines which match any of a set of pastterns which the user
- has specified. For each pattern, an action can be specified; this action
- will be performed on each line that matches the pattern.
-
- Readers familiar with the UNIX[*] program grep[1] will recognize the
- approach, although in awk the patterns may be more general than in grep,
- and the actions allowed are more involved than merely printing the matching
- line. For example the awk program
-
- { print $3, $2 }
-
- prints the third and second columns of a table in that order. The program
-
- $2 ~ /A|B|C/
-
- prints all input lines with an A, B, or C in the second field. The program
-
- $1 != prev { print; prev = $1 }
-
- prints all lines in which the first field is different from the previous
- first field.
-
- 1.1. Usage
-
- The command
-
- awk program [files]
-
- executes the awk commands in the string program on the set of named files,
- or on the standard input if there are no files. The statements can also be
- placed in a file pfile, and executed by the command
-
- awk -f pfile [files]
-
- 1.2. Program Structure
-
- An ask program is a sequence of statements of the form:
-
- pattern { action }
- pattern { action }
- ...
-
- Each line of input is matched against each of the patterns in turn. For
- each pattern that matches the associated action is executed. When all the
- patterns have been tested, the next line is fetched and the matching starts
- over.
-
- Either the pattern or the action may be left out, but not both. If
- there is no action for the pattern, the matching line is simply copied to
- the output. (Thus a line which matches several patterns can be printed
- several times.) If there is no pattern for an action, then the action is
- performed for every input line. A line which matches no pattern is
- ignored.
-
- Since patterns and actions are both optional, actions must be enclosed
- in braces to distinguish them from patterns.
-
- 1.3. Records and Fields
-
- Awk input is divided in to "records" terminated by a record separator.
- The default record separator is a newline, so by default awk processes its
- input a line at a time. The number of the current record is available in a
- variable named NR.
-
- Each input record is concidered to be divided into "fields." Fields
- are normally separated by white space - blanks or tabs - but the input
- field separator may be changed, as described below. Fields are referred to
- as $1, $2, and so forth, where $1 is the first field, and $0 is the whole
- input record itself. Fields may be assigned to. The number of fields in
- the current record is available in a variable named NF.
-
- The variables FS and RS refer to the input field and record
- separators; they may be changed at any time to a single character. The
- optional command line argument -Fc may also be used to set the FS to the
- character c.
-
- If the record separator is empty, an empty input line is taken as the
- record separator, and blanks, tabs and newlines are treated as field
- separators.
-
- The variable FILENAME contains the name of the current input file.
-
- 1.4. Printing
-
- An action may have no pattern, in which case the action is executed
- for all lines. The simplest action is to print some or all of a record;
- this is accomplished by the awk command print. The awk program
-
- { print }
-
- prints each record, thus copying the input to the output intact. More
- useful is to print a field or fields from each record. For instance,
-
- print $2, $1
-
- prints the first two fields in reverse order. Items separated by a comma
- in the print statement will be separated by the current output field
- separator when output. Items not separated by commas will be concatenated,
- so
-
- print $1 $2
-
- runs the first and second fields together.
-
- The predefined variables NF and NR can be used; for example
-
- { print NR, NF, $0 }
-
- prints each record preceded by the record number and the number of fields.
-
- Output may be diverted to multiple files; the program
-
- { print $1 >"foo1"; print $2 >"foo2" }
-
- writes the first field, $1 on the file foo1 and the second field on the
- file foo2. The >> notation can also be used:
-
- print $1 >>"foo"
-
- appends the output to the file foo. (In each case, the output files are
- created if necessary.) The file name can be a variable or a field as well
- as a constant; for example,
-
- print $1 >$2
-
- uses the contents of field 2 as a file name.
-
- Naturally there is a limit on the number of output files; currently it
- is 10.
-
- Similarly, output can be piped into another process (on UNIX only);
- for instance,
-
- print | "mail bwk"
-
- mails the output to bwk.
-
- The variables OFS and ORS may be used to change the current output
- field separator and output record separator. The output record separator
- is appended to the output of the print statement.
-
- Awk also provides the printf statement for ouput formatting:
-
- printf "%8.2f %10ld\n", $1, $2
-
- prints $1 as a floating point number 8 digits wide, with two after the
- decimal point, and $2 as a 10-digit long decimal number, followed by a
- newline. No output separators are produced automatically; you must add
- them yourself, as in this example. The version of printf is identical to
- that used with C[2].
-
- 2. Patterns
-
- A pattern in front of an action acts as a selector that determines
- whether the action is to be executed. A variety of expressions may be used
- as patterns; regular expressions, arithmetic relational expressions,
- string-valued expressions, and arbitrary boolean combinations of these.
-
- 2.1. BEGIN and END
-
- The special pattern BEGIN matches the beginning of the input, before
- the first record is read. The pattern END matches the end of the input,
- after the last record has been processed. BEGIN and END thus provide a way
- to gain control before and after processing, for initialization and wrapup.
-
- As an example, the field separator can be set to a colon by
-
- BEGIN { FS = ":" }
- ... rest of program ...
-
- Or the input lines may be counted by
-
- END { print NR }
-
- If BEGIN is present, it must be the first pattern; END must be the last if
- used.
-
- 2.2. Regular Expressions
-
- The simplest regular expression is a literal string of characters
- enclosed in slashes, like
-
- /smith/
-
- This is actually a complete awk program which will print any occurances of
- the name "smith". If a line contains "smith" as part of a larger word, it
- will also be printed, as in
-
- blacksmithing
-
- Awk regular expressions include the regular expression forms found in
- the UNIX text editor ed[1] and grep (without back referencing). In
- addition, awk allows parentheses for grouping, | for alternatives, + for
- "one or more", and ? for "zero or one", all as in lex[3]. Character
- classes may be abbreviated: [a-zA-Z0-9] is the set of all letters and
- digits. As an example, the awk program
-
- /[Aa]ho|[Ww]einberger|[Kk]ernighan/
-
- will print all lines which contain any of the names "Aho," "Weinberger" or
- "Kernighan," whether capitalized or not.
-
- Regular expressions (with the extensions listed above) must be
- enclosed in slashes, just as in ed and sed[1]. Within a regular
- expression, blanks and the regular expression metacharacters are
- signifigant. To turn off the magic meaning of one of the regular
- expression characters, precede it with a backslash. An example is the
- pattern
-
- /\/.*\//
-
- which matches any string of characters enclosed in slashes.
-
- One can also specify that any field or variable matches a regular
- expression (or does not match it) with the operators ~ and !~. The program
-
- $1 ~ /[jJ]ohn/
-
- prints all lines where the first field matches "john" or "John." Notice
- that this will also match "Johnson", "St. Johnsbury", and so on. To
- restrict it to exactly "John" use
-
- $1 ~ /^[jJ]ohn$/
-
- The caret ^ refers to the beginning of a line or field; the dollar sign $
- refers to the end.
-
- 2.3. Relational Expressions
-
- An awk pattern can be a relational expression involving the usual
- relational operators <, <=, ==, !=, >=, and >. An example is
-
- $2 > $1 + 100
-
- which selects lines where the second field is at least greater than the
- first field. Similarly
-
- NF % 2 == 0
-
- prints all lines with an even number of fields.
-
- In relational tests, if neither operand is numeric, a string
- comparison is made; otherwise it is numeric. Thus
-
- $1 >= "s"
-
- selects lines that begin with an s, t, u, etc. In the absence of any other
- information, fields are treated as strings, so the program
-
- $1 > $2
-
- will perform a string comparison.
-
- 2.4. Combinations of Patterns
-
- A pattern can be any boolean combination of patterns, using the
- operators || (or), && (and), and ! (not). For example,
-
- $1 >= "s" && $1 < "t" && $1 != "smith"
-
- selects lines where the first field begins with "s" but is not "smith". &&
- and || guarantee that their operands will be evaluated from left to right;
- evaluation stops as soon as the truth or falsehood is determined.
-
- 2.5. Pattern Ranges
-
- The "pattern" that selects an action may also consist of two patterns
- separated by commas, as in
-
- pat1, pat2 { ... }
-
- In this case, the action is performed for each line between an occurance of
- pat1 and the next occurance of pat2 (inclusive). For example,
-
- /start/,/stop/
-
- prints all lines between start and stop, while
-
- NR == 100, NR == 200 { ... }
-
- does the action for lines 100 through 200 of the input.
-
- 3. Actions
-
- An awk action is a sequence of action statements terminated by
- newlines or semicolons. These action statements can be used to do a
- variety of bookkeeping and string manipulating tasks.
-
- 3.1. Built-in Functions
-
- Awk provides a "length" function to compute the length of a string of
- characters. This program prints each record, preceded by its length:
-
- { print length, $0 }
-
- length by itself is a "pseudo-variable" which yeilds the length of the
- current record; length(argument) is a function which yeilds the length of
- its argument, as in the equivalent
-
- { print length($0), $0 }
-
- The argument may be any expression.
-
- Awk also provides the arithmetic functions sqrt, log, exp, and int,
- for square root, base e logarithm, exponential, and integer parts of their
- respective arguments.
-
- The name of one of these built-in functions without arguments or
- parentheses, stands for the value of the function on the whole record. The
- program
-
- length < 10 || length > 20
-
- prints lines whose length is less than 10 or more than 20.
-
- The function substr(s, m, n) produces the substring of s that begins
- at position m (origin 1) and is at most n characters long. If n is
- omitted, the substring goes to the end of s. The function index(s1, s2)
- returns the position where the string s2 occurs in s1, or zero if it does
- not.
-
- The function sprintf(f, e1, e2, ...) produces the value of the
- expressions e1, e2, etc, in the printf format specified by f. Thus for
- example,
-
- x = sprintf("%8.2f %10ld", $1, $2)
-
- sets x to the string produced by formatting the values of $1 and $2.
-
- 3.2. Variables, Expressions, and Assignments
-
- Awk variables take on numeric (floating point) or string values
- according to context. For example, in
-
- x = 1
-
- x is clearly a number, while in
-
- x = "smith"
-
- it is clearly a string. Strings are converted to numbers and vice versa
- whenever context demands it. For instance,
-
- x = "3" + "4"
-
- assigns 7 to x. Strings which cannot be interpreted as numbers in a
- numerical context will generally have numeric value zero, but it is unwise
- to count on this behavior.
-
- By default, variables (other than built-ins) are initialized to the
- null string, which has numerical value zero; this eliminates the need for
- most BEGIN sections. For example, the sums of the first two fields can be
- computed by
-
- { s1 += $1; s2 += $2 }
- END { print s1, s2 }
-
- Arithmetic is done internally in floating point. The arithmetic
- operators are +, -, *, /, and % (mod). The C increment ++ and decrement --
- operators are also available, and so are the assignment operators +=, -=,
- *=, /=, and %=. These operators may all be used in expressions.
-
- 3.3. Field Variables
-
- Fields in awk share essentially all of the properties of variables -
- they may be used in arithmetic or string operations, and may be assigned
- to. Thus one can replace the first field with a sequence number like this:
-
- { $1 = NR; print }
-
- or accumulate two fields into a third, like this:
-
- { $1 = $2 + $3; print $0 }
-
- or assign a string to a field:
-
- { if ($3 > 1000)
- $3 = "too bit"
- print
- }
-
- which replaces the third field by "too big" when it is, and in any case
- prints the record.
-
- Field references may be numerical expressions, as in
-
- { print $i, $(i+1), $(i+n) }
-
- Whether a field is deemed numeric or string depends on context; in
- ambiguous cases like
-
- if ($1 == $2) ...
-
- fields are treated like strings.
-
- Each input line is split into fields automatically as necessary. It
- is also possible to split an variable or string into fields:
-
- n = split(s, array, sep)
-
- splits the string s into array[1], ..., array[n]. The number of elements
- found is returned. If the sep argument is provided, it is used as the
- field separator; otherwise FS is used as the separator.
-
- 3.4. String Concatenation
-
- Strings may be concatenated. For example
-
- length($1 $2 $3)
-
- returns the length of the first three fields. Or in a print statement,
-
- print $1 " is " $2
-
- prints the two fields separated by " is ". Variables and numeric
- expressions may also appear in concatenations.
-
- 3.5. Arrays
-
- Array elements are not declared; they spring into existence by being
- mentioned. Subscripst may have any non-null value, including non-numeric
- strings. As an example of a conventional numeric subscript, the statement
-
- x[NR] = $0
-
- assigns the current input record to the NR-th element of the array x. In
- fact, it is possible in principle (although perhaps slow) to process the
- entire input in a random order with the awk program
-
- { x[NR] = $0 }
- END { ... program ... }
-
- The first actio merely records each input line in the array x.
-
- Array elements may be named by non-numeric values, which gives awk the
- capability rather like the associative memory of Snobol tables. Suppose
- the input contains fields with values like apple, orange, etc. Then the
- program
-
- /apple/ { x["apple"]++ }
- /orange/ { x["orange"]++ }
- END { print x["apple"], x["orange"] }
-
- increments counts for the named array elements, and prints them at the end
- of the input.
-
- 3.6. Flow-of-Control Statements
-
- Awk provides the basic flow-of-control statements if-else, while, for,
- and statement grouping within braces, as in C. We showed the if statement
- in section 3.3 without describing it. The condition i parentheses is
- evaluated; if it is true, the statement following the if is done. The else
- part is optional.
-
- The while statement is exactly like that of C. For example, to print
- all input fields one per line,
-
- i = 1
- while (i <= NF) {
- print $i
- ++i;
- }
-
- The for statement is exactly that of C;
-
- for (i = 1; i <= NF; i++ )
- print $i
-
- does the same job as the while statement above.
-
- There is an alternate form of the for statement which is suited for
- accessing the elements of an associative array:
-
- for (i in array)
- statement
-
- does statement with i set in turn to each element of array. The elements
- are accessed in an apparently random order. Chaos will ensue if i is
- altered, or if any new elements are accessed during the loop.
-
- The expression in the condition part of an if, while or for can
- include relational operators like <, <=, >, >=, == ("is equal to"), and !=
- ("not equal to"); regular expression matches with the match operators ~ and
- !~; the logical operators ||, &&, and !; and of course parentheses for
- grouping.
-
- The break statement causes an immediate exit from an enclosing while
- or for; the continue statement causes the next iteration to begin.
-
- The statement next causes awk to skip immediately to the next record
- and begin scanning from the top. The statement exit causes the program to
- behave as if the end of the input had occured.
-
- Comments may be placed in awk programs; they begin with the character
- # and end with the end of the line, as in
-
- print x, y # this is a comment
-
- 4. Design
-
- The UNIX system already provides several programs that operate by
- passing input through a selection mechanism. Grep, the first and simplest,
- merely prints all lines which match a single specified pattern. Egrep
- provides more general patterns, i.e., regular expressions in full
- generality; fgrep searches for a set of keywords with a particularly fast
- algorithm. Sed provides most of the editing facilities of the editor ed,
- applied to a stream of input. None of these programs provides numeric
- capabilities, logical relations, or variables.
-
- Lex provides general regular expression recognition capabilities, and,
- by serving as a C program generator, is essentially open-ended in its
- capabilities. The use of lex, however, requires a knowledge of C
- programming, and a lex program must be compiled and loaded before use,
- which discourages its use for one-shot applications.
-
- Awk is an attempt to fill in another part of the matrix of
- possibilities. It provides general regular expression capabilities and an
- implicit input/output loop. But it also provides convenient numeric
- processing, variables, more general selection, and control flow in the
- actions. It does not rerquire compilation or a knowledge of C. Finally
- awk provides a convenient way to access fields within lines; it is unique
- in this respect.
-
- Awk also tries to integrate strings and numbers completely, by
- treating all quantities as both string and numeric, deciding which
- representation is appropriate as late as possible. In most cases the user
- can simply ignore the differences.
-
- Most of the effort in developing awk went into deciding what awk
- should or should not do (for instance, it doesn't do string substitution)
- and what the syntax should be (no explicit operator for concatenation)
- rather than on writing or debugging the code. We have tried to make the
- syntax powerful but easy to use and well adapted to scanning files. For
- example, the absence of declarations and implicit initializations, while
- probably a bad idea for a general-purpose programming language, is
- desirable in a language that is meant to be used for tiny progrmas that may
- even be composed on the command line.
-
- In practice, awk usage seems to fall into two broad catagories. One
- is what might be called "report generation" - processing an input to
- extract counts, sums, sub-totals, etc. This also includes the writing of
- trivial data validation programs, such as verifying that a field contains
- only numeric informataion or that certain delimiters are properly balanced.
- The combination of textual and numeric processing is invaluable here.
-
- A second area of use is as a data transformer, converting data from
- the form produced by one program into that expected by another. The
- simplest examples merely select fields, perhaps with rearrangements.
-
- 5. Implementation
-
- The acutal implementation of awk uses the language development tools
- available on the UNIX operating system. The grammar is specified with
- yacc[4]; the lexical analysis is done by lex; the regular expression
- recognizers are deterministic finite automata constructed directly from the
- expressions. An awk program is translated into a parse tree which is then
- executed by a simple interpreter.
-
- Awk was designed for ease of use rather than processing speed; the
- delayed evaluation of variable types and the necessity to break input into
- fields makes high speed difficult to achieve in any case. Nontheless, the
- program has not proven to be unworkably slow.
-
- Table 1 below shows the execution (user + system) time on a PDP-11/70
- of the UNIX programs wc, grep, egrep, fgrep, sed, lex, and awk on the
- following simple tasks;
-
- 1. count the number of lines.
-
- 2. print all lines containing "doug"
-
- 3. print all lines containing "doug", "ken", or "dmr".
-
- 4. print the third field of each line.
-
- 5. print the third and second fields of each line in that order
-
- 6. append all lines containing "doug", "ken", and "dmr" to files
- "jdoug", "jken", and "jdmr", respectivly.
-
- 7. print each line prefixed by "line-number:".
-
- 8. sum the fourth column of a table.
-
- The program wc merely counts words, lines and characters in its input; we
- have already mentioned the others. In all cases the input was a file
- containing 10,000 lines as created by the command ls -l; each line has the
- form
-
- -rw-rw-rw- 1 ava 123 Oct 15 17:05 xxx
-
- The total length of this input is 452,960 characters. Times for lex do not
- include compile or load.
-
- As might be expected, awk is not as fast as the specialized tools wc,
- sed, or the programs in the grep family, but is faster than the more
- general tool lex. In all cases, the tasks were about as easy to express as
- awk programs as programs in these other languages; tasks involving fields
- were conciderable easier to express as awk programs. Some of the test
- programs are shown in awk, sed and lex.
-
- [*] UNIX is a Trademark of Bell Laboratories
-
- References
-
- [1] K. Thompson and D. M. Ritchie, UNIX Programmers Manual.
- Bell Laboratories (May 1975). Sixth Edition
-
- [2] B. W. Kernighan and D. M. Ritchie, The C Programming Language,
- Prentice-Hall, Englewood Cliffs, New Jersey (1978)
-
- [3] M. E. Lesk, "Lex - A Lexical Analyzer Generator,"
- Comp. Sci. Tech. Rep. No. 39,
- Bell Laboratories, Murray Hill, New Jersey (October 1975)
-
- [4] S. C. Johnson, "Yacc - Yet Another Compiler-Compiler,"
- Comp. Sci. Tech. Rep. No. 32,
- Bell Laboratories, Murray Hill, New Jersey (July 1975)
-
-
- Task
-
- program 1 2 3 4 5 6 7 8
-
- wc 8.6
- grep 11.7 13.1
- egrep 6.2 11.5 11.6
- fgrep 7.7 13.8 16.1
- sed 10.2 11.6 15.8 29.0 30.5 16.1
- lex 65.1 150.1 144.2 67.7 70.3 104.0 81.7 92.8
- awk 15.0 25.6 29.9 33.3 38.9 46.4 71.4 31.1
-
- Table 1. Execution Times of Programs (Times are in sec.)
-
-
- The programs for some of these jobs are shown below. The lex programs
- are generally too long to show.
-
- AWK:
-
- 1. END { print NR }
-
- 2. /doug/
-
- 3. /ken|doug|dmr/
-
- 4. { print $3 }
-
- 5. { print $3, $2 }
-
- 6. /ken/ { print >"jken" }
- /doug/ { print >"jdoug" }
- /dmr/ { print >"jdmr" }
-
- 7. { print NR ": " $0 }
-
- 8. { sum = sum + $4 }
- END { print sum }
-
- SED:
-
- 1. $=
-
- 2. /doug/p
-
- 3. /doug/p
- /doug/d
- /ken/p
- /ken/d
- /dmr/p
- /dmr/d
-
- 4. /[^ ]* [ ]*[^ ]* [ ]*\([^ ]*\) .*/s//\1/p
-
- 5. /[^ ]* [ ]*\([^ ]*\) [ ]*\([^ ]*\) .*/s//\2 \1/p
-
- 6 /ken/w jken
- /doug/w jdoug
- /dmr/w jdmr
-
- LEX:
-
- 1. %{
- int i;
- %}
- %%
- \n i++;
- . ;
- %%
- yywrap() {
- printf("%d\n", i);
- }
-
- 2. %%
- ^.*doug.*$ printf("%s\n", yytext);
- . ;
- \n ;
-
-
-
-
-
-
-
- AWK 1985
-
- Rob Duff 01-Mar-88
- Vancouver BC Canada V5L 1E5
-
-
-
-
-
-
- In 1985 Aho, Weinberger and Kernighan revised the definition of the AWK
- programming language. Some of the additions include user defined functions,
- access to command line arguments, more arithmetic functions, and separate
- input. These changes reflected the use that the authors found other people
- making of the language.
-
- ----------------end-of-author's-documentation---------------
-
- Software Library Information:
-
- This disk copy provided as a service of
-
- The Public (Software) Library
-
- We are not the authors of this program, nor are we associated
- with the author in any way other than as a distributor of the
- program in accordance with the author's terms of distribution.
-
- Please direct shareware payments and specific questions about
- this program to the author of the program, whose name appears
- elsewhere in this documentation. If you have trouble getting
- in touch with the author, we will do whatever we can to help
- you with your questions. All programs have been tested and do
- run. To report problems, please use the form that is in the
- file PROBLEM.DOC on many of our disks or in other written for-
- mat with screen printouts, if possible. The P(s)L cannot de-
- bug programs over the telephone.
-
- Disks in the P(s)L are updated monthly, so if you did not get
- this disk directly from the P(s)L, you should be aware that
- the files in this set may no longer be the current versions.
-
- For a copy of the latest monthly software library newsletter
- and a list of the 1,900+ disks in the library, call or write
-
- The Public (Software) Library
- P.O.Box 35705 - F
- Houston, TX 77235-5705
- (713) 665-7017
-